1
Introduction to Private Data Preparation in RAG
AI010 Lesson 7
00:00

The RAG Foundation

Standard Large Language Models (LLMs) are "frozen" in time, limited by their training data cut-off. They cannot answer questions about your companyโ€™s internal handbook or a private video meeting from yesterday. Retrieval Augmented Generation (RAG) bridges this gap by providing the LLM with relevant context retrieved from your own private data.

The Multi-Step Workflow

To make private data "readable" for an LLM, we follow a specific pipeline:

  • Loading: Converting various formats (PDF, Web, YouTube) into a standard document format.
  • Splitting: Breaking long documents into smaller, manageable "chunks."
  • Embedding: Converting text chunks into numerical vectors (mathematical representations of meaning).
  • Storage: Saving these vectors in a Vectorstore (like Chroma) for lightning-fast similarity searching.
Why Chunking Matters
LLMs have a "context window" (limit on how much text they can process at once). If you send a 100-page PDF, the model will fail. We split data into chunks to ensure only the most relevant pieces of information are sent to the model.
main.py
TERMINAL bash โ€” 80x24
> Ready. Click "Run" to execute.
>
Question 1
Why is chunk_overlap considered a critical parameter when splitting documents for RAG?
To reduce the total number of tokens used by the LLM.
To ensure that semantic context (the meaning of a thought) is not cut off at the end of a chunk.
To make the vector database store data faster.
Challenge: Preserving Context
Apply your knowledge to a real-world scenario.
You are loading a YouTube transcript for a technical lecture. You notice that the search results are confusing "Lecture 1" content with "Lecture 2."
Task
Which splitter would be best for keeping context like "Section Headers" intact?
Solution:
MarkdownHeaderTextSplitter or RecursiveCharacterTextSplitter. These allow you to maintain document structure in the metadata, helping the retrieval system distinguish between different chapters or lectures.